In [1]:
!pip install h2o
Collecting h2o
Downloading h2o-3.44.0.3.tar.gz (265.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 265.2/265.2 MB 2.1 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from h2o) (2.31.0)
Requirement already satisfied: tabulate in /usr/local/lib/python3.10/dist-packages (from h2o) (0.9.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (2024.2.2)
Building wheels for collected packages: h2o
Building wheel for h2o (setup.py) ... done
Created wheel for h2o: filename=h2o-3.44.0.3-py2.py3-none-any.whl size=265293968 sha256=d460947e56fc52a602a119df202d5b640c5dc7e06e91ae15408ca055178ada9c
Stored in directory: /root/.cache/pip/wheels/77/9a/1c/2da26f943fd46b57f3c20b54847b936b9152b831dc7447cf71
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.44.0.3
In [2]:
import h2o
from h2o.automl import H2OAutoML
In [3]:
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321..... not found. Attempting to start a local H2O server... Java Version: openjdk version "11.0.21" 2023-10-17; OpenJDK Runtime Environment (build 11.0.21+9-post-Ubuntu-0ubuntu122.04); OpenJDK 64-Bit Server VM (build 11.0.21+9-post-Ubuntu-0ubuntu122.04, mixed mode, sharing) Starting server from /usr/local/lib/python3.10/dist-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpt5itl_om JVM stdout: /tmp/tmpt5itl_om/h2o_unknownUser_started_from_python.out JVM stderr: /tmp/tmpt5itl_om/h2o_unknownUser_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful.
| H2O_cluster_uptime: | 07 secs |
| H2O_cluster_timezone: | Etc/UTC |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.44.0.3 |
| H2O_cluster_version_age: | 1 month and 27 days |
| H2O_cluster_name: | H2O_from_python_unknownUser_wuxgfb |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 3.170 Gb |
| H2O_cluster_total_cores: | 2 |
| H2O_cluster_allowed_cores: | 2 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://127.0.0.1:54321 |
| H2O_connection_proxy: | {"http": null, "https": null, "colab_language_server": "/usr/colab/bin/language_service"} |
| H2O_internal_security: | False |
| Python_version: | 3.10.12 final |
importing data¶
In [4]:
day = h2o.import_file("day.csv")
hour = h2o.import_file("hour.csv")
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
In [5]:
#target column
y = "cnt"
Split into train & test¶
In [6]:
# splitting day dataset
splits_day = day.split_frame(ratios = [0.8], seed = 1)
train_day = splits_day[0]
test_day = splits_day[1]
#splitting hour dataset
splits_hour = day.split_frame(ratios = [0.8], seed = 1)
train_hour = splits_hour[0]
test_hour = splits_hour[1]
In [7]:
# Run AutoML for 10 minutes
aml = H2OAutoML(max_runtime_secs=600, seed=1)
aml.train(y=y, training_frame=train_day)
AutoML progress: | 14:44:03.745: _train param, Dropping bad and constant columns: [dteday] ██ 14:44:10.148: _train param, Dropping bad and constant columns: [dteday] 14:44:11.932: _train param, Dropping bad and constant columns: [dteday] █ 14:44:16.687: _train param, Dropping unused columns: [dteday] 14:44:17.0: _train param, Dropping bad and constant columns: [dteday] ██ 14:44:21.82: _train param, Dropping bad and constant columns: [dteday] █ 14:44:26.856: _train param, Dropping bad and constant columns: [dteday] █ 14:44:30.329: _train param, Dropping bad and constant columns: [dteday] █ 14:44:33.575: _train param, Dropping bad and constant columns: [dteday] █ 14:44:37.405: _train param, Dropping unused columns: [dteday] 14:44:37.702: _train param, Dropping unused columns: [dteday] 14:44:38.32: _train param, Dropping bad and constant columns: [dteday] ██ 14:44:42.430: _train param, Dropping bad and constant columns: [dteday] ██ 14:44:47.736: _train param, Dropping bad and constant columns: [dteday] █ 14:44:50.506: _train param, Dropping bad and constant columns: [dteday] 14:44:52.607: _train param, Dropping unused columns: [dteday] 14:44:52.878: _train param, Dropping unused columns: [dteday] ████████████████████████████████████████████ 14:53:22.69: _train param, Dropping unused columns: [dteday] 14:53:22.448: _train param, Dropping unused columns: [dteday] ████ 14:53:57.650: _train param, Dropping unused columns: [dteday] █| (done) 100% 14:54:00.760: _train param, Dropping bad and constant columns: [dteday] 14:54:01.788: _train param, Dropping unused columns: [dteday] 14:54:03.207: _train param, Dropping unused columns: [dteday]
Out[7]:
Model Details ============= H2OStackedEnsembleEstimator : Stacked Ensemble Model Key: StackedEnsemble_BestOfFamily_4_AutoML_1_20240217_144403
| key | value |
|---|---|
| Stacking strategy | cross_validation |
| Number of base models (used / total) | 4/6 |
| # GBM base models (used / total) | 1/1 |
| # XGBoost base models (used / total) | 1/1 |
| # DRF base models (used / total) | 0/2 |
| # GLM base models (used / total) | 1/1 |
| # DeepLearning base models (used / total) | 1/1 |
| Metalearner algorithm | GLM |
| Metalearner fold assignment scheme | Random |
| Metalearner nfolds | 5 |
| Metalearner fold_column | None |
| Custom metalearner hyperparameters | None |
ModelMetricsRegressionGLM: stackedensemble ** Reported on train data. ** MSE: 486.75071155330653 RMSE: 22.062427598822993 MAE: 17.055120273597556 RMSLE: 0.06120432078413894 Mean Residual Deviance: 486.75071155330653 R^2: 0.9998720371501463 Null degrees of freedom: 581 Residual degrees of freedom: 577 Null deviance: 2213837175.773192 Residual deviance: 283288.9141240244 AIC: 5264.916178398554
ModelMetricsRegressionGLM: stackedensemble ** Reported on cross-validation data. ** MSE: 15312.971036722769 RMSE: 123.74558996878542 MAE: 77.3034660817466 RMSLE: 0.1392697751362131 Mean Residual Deviance: 15312.971036722769 R^2: 0.9959743429910287 Null degrees of freedom: 581 Residual degrees of freedom: 576 Null deviance: 2226303681.5212297 Residual deviance: 8912149.143372651 AIC: 7274.061570176624
| mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | |
|---|---|---|---|---|---|---|---|
| mae | 77.20995 | 5.3290925 | 82.580605 | 77.97429 | 77.375755 | 79.72927 | 68.38984 |
| mean_residual_deviance | 15284.537 | 3083.0703 | 15855.276 | 14955.0625000 | 16031.55 | 19064.777 | 10516.0205000 |
| mse | 15284.537 | 3083.0703 | 15855.276 | 14955.0625000 | 16031.55 | 19064.777 | 10516.0205000 |
| null_deviance | 445260736.0000000 | 39754988.0000000 | 513840064.0000000 | 416552128.0000000 | 418845568.0000000 | 440299872.0000000 | 436766048.0000000 |
| r2 | 0.9959117 | 0.0009394 | 0.9963571 | 0.9954821 | 0.9957655 | 0.9947284 | 0.9972252 |
| residual_deviance | 1782429.9 | 381501.3 | 1870922.6 | 1869382.8 | 1763470.5 | 2230579.0 | 1177794.4 |
| rmse | 123.08946 | 12.919091 | 125.91774 | 122.29089 | 126.61575 | 138.07526 | 102.54765 |
| rmsle | 0.0905134 | 0.1117076 | 0.0411183 | 0.2903026 | 0.0433797 | 0.0405938 | 0.0371726 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [8]:
# Explain leader model & compare with all AutoML models
exa = aml.explain(test_day)
Leaderboard
Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
| model_id | rmse | mse | mae | rmsle | mean_residual_deviance | training_time_ms | predict_time_per_row_ms | algo |
|---|---|---|---|---|---|---|---|---|
| GBM_grid_1_AutoML_1_20240217_144403_model_59 | 93.1523 | 8677.35 | 66.9168 | 0.0309652 | 8677.35 | 391 | 0.037761 | GBM |
| GBM_grid_1_AutoML_1_20240217_144403_model_12 | 102.693 | 10545.8 | 69.0987 | 0.0259967 | 10545.8 | 529 | 0.032917 | GBM |
| GBM_grid_1_AutoML_1_20240217_144403_model_51 | 103.076 | 10624.7 | 63.5736 | 0.0364295 | 10624.7 | 401 | 0.032205 | GBM |
| GBM_grid_1_AutoML_1_20240217_144403_model_43 | 103.456 | 10703 | 74.2977 | 0.0345611 | 10703 | 399 | 0.032722 | GBM |
| GBM_grid_1_AutoML_1_20240217_144403_model_50 | 108.278 | 11724.1 | 70.7343 | 0.0353118 | 11724.1 | 381 | 0.030051 | GBM |
| GBM_grid_1_AutoML_1_20240217_144403_model_5 | 110.504 | 12211.1 | 75.2453 | 0.0338933 | 12211.1 | 484 | 0.03774 | GBM |
| StackedEnsemble_AllModels_2_AutoML_1_20240217_144403 | 116.541 | 13581.7 | 81.6209 | 0.0370405 | 13581.7 | 284 | 0.078936 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_4_AutoML_1_20240217_144403 | 116.817 | 13646.2 | 76.5354 | 0.0298819 | 13646.2 | 281 | 0.103605 | StackedEnsemble |
| GBM_grid_1_AutoML_1_20240217_144403_model_44 | 117.215 | 13739.4 | 74.4341 | 0.0385557 | 13739.4 | 381 | 0.039066 | GBM |
| GBM_grid_1_AutoML_1_20240217_144403_model_29 | 117.541 | 13815.9 | 84.0961 | 0.0479065 | 13815.9 | 340 | 0.03373 | GBM |
| GBM_grid_1_AutoML_1_20240217_144403_model_2 | 118.505 | 14043.5 | 82.3569 | 0.0352881 | 14043.5 | 286 | 0.031749 | GBM |
| XGBoost_grid_1_AutoML_1_20240217_144403_model_48 | 118.583 | 14062 | 87.7628 | 0.0312422 | 14062 | 305 | 0.01148 | XGBoost |
| GBM_grid_1_AutoML_1_20240217_144403_model_42 | 118.592 | 14064.1 | 82.0764 | 0.0408402 | 14064.1 | 289 | 0.030322 | GBM |
| StackedEnsemble_BestOfFamily_5_AutoML_1_20240217_144403 | 119.841 | 14361.9 | 84.2925 | 0.0315067 | 14361.9 | 1380 | 0.097068 | StackedEnsemble |
| StackedEnsemble_AllModels_1_AutoML_1_20240217_144403 | 120.772 | 14585.8 | 85.5559 | 0.0387308 | 14585.8 | 318 | 0.110122 | StackedEnsemble |
| GBM_grid_1_AutoML_1_20240217_144403_model_47 | 123.506 | 15253.7 | 81.6336 | 0.0408862 | 15253.7 | 696 | 0.047935 | GBM |
| GBM_3_AutoML_1_20240217_144403 | 124.079 | 15395.5 | 89.441 | 0.0402024 | 15395.5 | 483 | 0.031665 | GBM |
| StackedEnsemble_BestOfFamily_3_AutoML_1_20240217_144403 | 124.561 | 15515.5 | 87.7307 | 0.0417306 | 15515.5 | 256 | 0.057956 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_2_AutoML_1_20240217_144403 | 124.879 | 15594.7 | 88.3318 | 0.0405229 | 15594.7 | 277 | 0.100871 | StackedEnsemble |
| XGBoost_grid_1_AutoML_1_20240217_144403_model_55 | 125.334 | 15708.5 | 94.8567 | 0.0368063 | 15708.5 | 525 | 0.012201 | XGBoost |
[20 rows x 9 columns]
Residual Analysis
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Variable Importance
The variable importance plot shows the relative importance of the most important variables in the model.
Variable Importance Heatmap
Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
Model Correlation
This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
SHAP Summary
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectation
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
In [9]:
# Explain a single H2O model (e.g. leader model from AutoML)
exm = aml.leader.explain(test_day)
Residual Analysis
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectation
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
In [10]:
pred = aml.predict(test_day)
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
In [11]:
pred
Out[11]:
| predict |
|---|
| 1354.34 |
| 1362.21 |
| 1481.46 |
| 1711.95 |
| 1600.66 |
| 1034.18 |
| 1426.78 |
| 2049.42 |
| 659.64 |
| 2016.87 |
[149 rows x 1 column]
model 2¶
In [12]:
from h2o.estimators import H2ORandomForestEstimator
In [13]:
day_drf = H2ORandomForestEstimator()
day_drf.train(y=y, training_frame=train_day)
drf Model Build progress: |
/usr/local/lib/python3.10/dist-packages/h2o/estimators/estimator_base.py:192: RuntimeWarning: Dropping bad and constant columns: [dteday] warnings.warn(mesg["message"], RuntimeWarning)
██████████████████████████████████████████████████████| (done) 100%
Out[13]:
Model Details ============= H2ORandomForestEstimator : Distributed Random Forest Model Key: DRF_model_python_1708181008418_30
| number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
|---|---|---|---|---|---|---|---|---|---|
| 50.0 | 50.0 | 229884.0 | 15.0 | 20.0 | 17.4 | 343.0 | 382.0 | 361.56 |
ModelMetricsRegression: drf ** Reported on train data. ** MSE: 81508.73927224868 RMSE: 285.4973542298574 MAE: 190.0332934697426 RMSLE: 0.22197021100068237 Mean Residual Deviance: 81508.73927224868
| timestamp | duration | number_of_trees | training_rmse | training_mae | training_deviance | |
|---|---|---|---|---|---|---|
| 2024-02-17 15:02:21 | 0.009 sec | 0.0 | nan | nan | nan | |
| 2024-02-17 15:02:21 | 0.043 sec | 1.0 | 588.7176019 | 394.9585253 | 346588.4147465 | |
| 2024-02-17 15:02:21 | 0.053 sec | 2.0 | 605.4378599 | 394.8101449 | 366555.0021739 | |
| 2024-02-17 15:02:21 | 0.067 sec | 3.0 | 550.2943324 | 357.3465517 | 302823.8522190 | |
| 2024-02-17 15:02:21 | 0.084 sec | 4.0 | 505.4228509 | 330.8135828 | 255452.2582583 | |
| 2024-02-17 15:02:21 | 0.098 sec | 5.0 | 499.2198643 | 325.2633077 | 249220.4729553 | |
| 2024-02-17 15:02:21 | 0.110 sec | 6.0 | 506.9312072 | 328.5808380 | 256979.2488527 | |
| 2024-02-17 15:02:21 | 0.123 sec | 7.0 | 488.0326266 | 328.0297450 | 238175.8445900 | |
| 2024-02-17 15:02:21 | 0.135 sec | 8.0 | 461.5549127 | 310.0436004 | 213032.9374466 | |
| 2024-02-17 15:02:21 | 0.147 sec | 9.0 | 441.3220903 | 296.5101315 | 194765.1874102 | |
| --- | --- | --- | --- | --- | --- | --- |
| 2024-02-17 15:02:21 | 0.599 sec | 41.0 | 280.7523216 | 190.1656341 | 78821.8661065 | |
| 2024-02-17 15:02:21 | 0.621 sec | 42.0 | 278.0150279 | 187.5232346 | 77292.3557641 | |
| 2024-02-17 15:02:21 | 0.634 sec | 43.0 | 277.5362362 | 186.8553769 | 77026.3624273 | |
| 2024-02-17 15:02:21 | 0.652 sec | 44.0 | 277.3277469 | 187.0116936 | 76910.6792138 | |
| 2024-02-17 15:02:21 | 0.675 sec | 45.0 | 278.1648407 | 187.6010672 | 77375.6786035 | |
| 2024-02-17 15:02:22 | 0.696 sec | 46.0 | 279.2733646 | 188.3932677 | 77993.6121641 | |
| 2024-02-17 15:02:22 | 0.709 sec | 47.0 | 278.8433704 | 187.7151335 | 77753.6252406 | |
| 2024-02-17 15:02:22 | 0.722 sec | 48.0 | 278.6353297 | 186.3397700 | 77637.6469438 | |
| 2024-02-17 15:02:22 | 0.740 sec | 49.0 | 282.1794580 | 188.8148377 | 79625.2465064 | |
| 2024-02-17 15:02:22 | 0.752 sec | 50.0 | 285.4973542 | 190.0332935 | 81508.7392722 |
[51 rows x 7 columns]
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| registered | 37045891072.0000000 | 1.0 | 0.4211622 |
| instant | 13731277824.0000000 | 0.3706559 | 0.1561062 |
| casual | 13455325184.0000000 | 0.3632070 | 0.1529690 |
| temp | 9343237120.0000000 | 0.2522071 | 0.1062201 |
| yr | 3785519872.0000000 | 0.1021846 | 0.0430363 |
| atemp | 3496740096.0000000 | 0.0943894 | 0.0397533 |
| season | 2091005440.0000000 | 0.0564437 | 0.0237719 |
| mnth | 1695710464.0000000 | 0.0457732 | 0.0192780 |
| hum | 983589632.0000000 | 0.0265506 | 0.0111821 |
| weathersit | 734234368.0000000 | 0.0198196 | 0.0083473 |
| workingday | 573922432.0000000 | 0.0154922 | 0.0065247 |
| windspeed | 508986432.0000000 | 0.0137393 | 0.0057865 |
| weekday | 472511264.0000000 | 0.0127548 | 0.0053718 |
| holiday | 43159148.0000000 | 0.0011650 | 0.0004907 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [14]:
exp = day_drf.explain(test_day)
Residual Analysis
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Variable Importance
The variable importance plot shows the relative importance of the most important variables in the model.
SHAP Summary
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectation
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
In [15]:
perf = day_drf.model_performance()
In [16]:
perf
Out[16]:
ModelMetricsRegression: drf ** Reported on train data. ** MSE: 81508.73927224868 RMSE: 285.4973542298574 MAE: 190.0332934697426 RMSLE: 0.22197021100068237 Mean Residual Deviance: 81508.73927224868
In [17]:
pred1 = day_drf.predict(test_day)
pred1
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[17]:
| predict |
|---|
| 1368.32 |
| 1354.18 |
| 1480.34 |
| 1758.58 |
| 1589.86 |
| 1426.08 |
| 1525.5 |
| 2053.12 |
| 1108.74 |
| 2030.34 |
[149 rows x 1 column]
testing model on hour dataset¶
In [18]:
# Run AutoML for 10 minutes
aml1 = H2OAutoML(max_runtime_secs=600, seed=1)
aml1.train(y=y, training_frame=train_hour)
AutoML progress: | 15:03:33.9: _train param, Dropping bad and constant columns: [dteday] █ 15:03:36.836: _train param, Dropping bad and constant columns: [dteday] 15:03:37.190: _train param, Dropping bad and constant columns: [dteday] █ 15:03:38.880: _train param, Dropping unused columns: [dteday] 15:03:39.13: _train param, Dropping bad and constant columns: [dteday] ██ 15:03:41.688: _train param, Dropping bad and constant columns: [dteday] ██ 15:03:44.442: _train param, Dropping bad and constant columns: [dteday] ██ 15:03:48.679: _train param, Dropping bad and constant columns: [dteday] █ 15:03:53.487: _train param, Dropping bad and constant columns: [dteday] █ 15:03:55.955: _train param, Dropping unused columns: [dteday] 15:03:56.90: _train param, Dropping unused columns: [dteday] 15:03:56.251: _train param, Dropping bad and constant columns: [dteday] █ 15:03:58.276: _train param, Dropping bad and constant columns: [dteday] ██ 15:04:02.12: _train param, Dropping bad and constant columns: [dteday] █ 15:04:05.460: _train param, Dropping bad and constant columns: [dteday] █ 15:04:06.186: _train param, Dropping unused columns: [dteday] 15:04:06.325: _train param, Dropping unused columns: [dteday] ███████████████████████████████████████████ 15:12:50.185: _train param, Dropping unused columns: [dteday] 15:12:50.356: _train param, Dropping unused columns: [dteday] █████| (done) 100% 15:13:28.160: _train param, Dropping unused columns: [dteday] 15:13:30.313: _train param, Dropping bad and constant columns: [dteday] 15:13:32.329: _train param, Dropping unused columns: [dteday]
Out[18]:
Model Details ============= H2OStackedEnsembleEstimator : Stacked Ensemble Model Key: StackedEnsemble_BestOfFamily_4_AutoML_2_20240217_150332
| key | value |
|---|---|
| Stacking strategy | cross_validation |
| Number of base models (used / total) | 4/6 |
| # GBM base models (used / total) | 1/1 |
| # XGBoost base models (used / total) | 1/1 |
| # DRF base models (used / total) | 0/2 |
| # GLM base models (used / total) | 1/1 |
| # DeepLearning base models (used / total) | 1/1 |
| Metalearner algorithm | GLM |
| Metalearner fold assignment scheme | Random |
| Metalearner nfolds | 5 |
| Metalearner fold_column | None |
| Custom metalearner hyperparameters | None |
ModelMetricsRegressionGLM: stackedensemble ** Reported on train data. ** MSE: 620.9296127118674 RMSE: 24.918459276445393 MAE: 19.01999641086325 RMSLE: 0.0724438562484004 Mean Residual Deviance: 620.9296127118674 R^2: 0.9998367625954822 Null degrees of freedom: 581 Residual degrees of freedom: 577 Null deviance: 2213837175.773192 Residual deviance: 361381.03459830687 AIC: 5406.61317176631
ModelMetricsRegressionGLM: stackedensemble ** Reported on cross-validation data. ** MSE: 15159.64352909012 RMSE: 123.12450417804784 MAE: 77.76431753872633 RMSLE: 0.1426861714671551 Mean Residual Deviance: 15159.64352909012 R^2: 0.9960146515604297 Null degrees of freedom: 581 Residual degrees of freedom: 577 Null deviance: 2226303681.5212297 Residual deviance: 8822912.53393045 AIC: 7266.204681044688
| mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | |
|---|---|---|---|---|---|---|---|
| mae | 77.69505 | 3.7474957 | 83.27274 | 77.30204 | 75.21033 | 79.091995 | 73.59817 |
| mean_residual_deviance | 15132.089 | 2163.0293 | 16115.072 | 14761.508 | 14749.066 | 17970.209 | 12064.588 |
| mse | 15132.089 | 2163.0293 | 16115.072 | 14761.508 | 14749.066 | 17970.209 | 12064.588 |
| null_deviance | 445260736.0000000 | 39754988.0000000 | 513840064.0000000 | 416552128.0000000 | 418845568.0000000 | 440299872.0000000 | 436766048.0000000 |
| r2 | 0.995958 | 0.0006905 | 0.9962975 | 0.9955406 | 0.9961043 | 0.9950311 | 0.9968165 |
| residual_deviance | 1764582.5 | 287461.72 | 1901578.5 | 1845188.4 | 1622397.2 | 2102514.5 | 1351233.9 |
| rmse | 122.75595 | 8.878689 | 126.94515 | 121.49694 | 121.44573 | 134.05301 | 109.83892 |
| rmsle | 0.0922739 | 0.1148774 | 0.0449698 | 0.2977287 | 0.0401578 | 0.0389994 | 0.0395137 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [19]:
# Explain leader model & compare with all AutoML models
exa1 = aml1.explain(test_hour)
Leaderboard
Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
| model_id | rmse | mse | mae | rmsle | mean_residual_deviance | training_time_ms | predict_time_per_row_ms | algo |
|---|---|---|---|---|---|---|---|---|
| GBM_grid_1_AutoML_2_20240217_150332_model_59 | 93.1523 | 8677.35 | 66.9168 | 0.0309652 | 8677.35 | 379 | 0.026677 | GBM |
| GBM_grid_1_AutoML_2_20240217_150332_model_12 | 102.693 | 10545.8 | 69.0987 | 0.0259967 | 10545.8 | 298 | 0.023362 | GBM |
| GBM_grid_1_AutoML_2_20240217_150332_model_51 | 103.076 | 10624.7 | 63.5736 | 0.0364295 | 10624.7 | 420 | 0.03555 | GBM |
| GBM_grid_1_AutoML_2_20240217_150332_model_43 | 103.456 | 10703 | 74.2977 | 0.0345611 | 10703 | 385 | 0.04154 | GBM |
| StackedEnsemble_BestOfFamily_5_AutoML_2_20240217_150332 | 107.313 | 11516 | 78.6803 | 0.0304245 | 11516 | 1048 | 0.125501 | StackedEnsemble |
| GBM_grid_1_AutoML_2_20240217_150332_model_50 | 108.278 | 11724.1 | 70.7343 | 0.0353118 | 11724.1 | 584 | 0.033699 | GBM |
| GBM_grid_1_AutoML_2_20240217_150332_model_5 | 110.504 | 12211.1 | 75.2453 | 0.0338933 | 12211.1 | 425 | 0.059218 | GBM |
| StackedEnsemble_BestOfFamily_4_AutoML_2_20240217_150332 | 116.329 | 13532.4 | 76.6121 | 0.0309692 | 13532.4 | 133 | 0.051419 | StackedEnsemble |
| StackedEnsemble_AllModels_2_AutoML_2_20240217_150332 | 116.432 | 13556.4 | 81.7634 | 0.0365547 | 13556.4 | 289 | 0.068028 | StackedEnsemble |
| GBM_grid_1_AutoML_2_20240217_150332_model_44 | 117.215 | 13739.4 | 74.4341 | 0.0385557 | 13739.4 | 517 | 0.026806 | GBM |
| GBM_grid_1_AutoML_2_20240217_150332_model_29 | 117.541 | 13815.9 | 84.0961 | 0.0479065 | 13815.9 | 311 | 0.025232 | GBM |
| XGBoost_grid_1_AutoML_2_20240217_150332_model_74 | 118.446 | 14029.4 | 84.7568 | 0.0353555 | 14029.4 | 319 | 0.011452 | XGBoost |
| GBM_grid_1_AutoML_2_20240217_150332_model_2 | 118.505 | 14043.5 | 82.3569 | 0.0352881 | 14043.5 | 246 | 0.02131 | GBM |
| XGBoost_grid_1_AutoML_2_20240217_150332_model_48 | 118.583 | 14062 | 87.7628 | 0.0312422 | 14062 | 314 | 0.008866 | XGBoost |
| GBM_grid_1_AutoML_2_20240217_150332_model_42 | 118.592 | 14064.1 | 82.0764 | 0.0408402 | 14064.1 | 260 | 0.032528 | GBM |
| StackedEnsemble_AllModels_1_AutoML_2_20240217_150332 | 120.772 | 14585.8 | 85.5559 | 0.0387308 | 14585.8 | 151 | 0.074889 | StackedEnsemble |
| GBM_grid_1_AutoML_2_20240217_150332_model_47 | 123.506 | 15253.7 | 81.6336 | 0.0408862 | 15253.7 | 660 | 0.028998 | GBM |
| GBM_3_AutoML_2_20240217_150332 | 124.079 | 15395.5 | 89.441 | 0.0402024 | 15395.5 | 559 | 0.038064 | GBM |
| StackedEnsemble_BestOfFamily_3_AutoML_2_20240217_150332 | 124.367 | 15467.1 | 87.9817 | 0.0411647 | 15467.1 | 132 | 0.07249 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_2_AutoML_2_20240217_150332 | 124.879 | 15594.7 | 88.3318 | 0.0405229 | 15594.7 | 128 | 0.052221 | StackedEnsemble |
[20 rows x 9 columns]
Residual Analysis
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Variable Importance
The variable importance plot shows the relative importance of the most important variables in the model.
Variable Importance Heatmap
Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
Model Correlation
This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
SHAP Summary
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectation
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
In [20]:
# Explain a single H2O model (e.g. leader model from AutoML)
exm1 = aml1.leader.explain(test_hour)
Residual Analysis
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectation
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
In [21]:
pred2 = aml1.predict(test_day)
pred2
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
Out[21]:
| predict |
|---|
| 1356.52 |
| 1368.32 |
| 1490.08 |
| 1710.02 |
| 1598.88 |
| 1023.14 |
| 1447.29 |
| 2023.49 |
| 646.918 |
| 1998.5 |
[149 rows x 1 column]
In [21]: